682 research outputs found

    Indices and Applications in High-Throughput Sequencing

    Get PDF
    Recent advances in sequencing technology allow to produce billions of base pairs per day in the form of reads of length 100 bp an longer and current developments promise the personal $1,000 genome in a couple of years. The analysis of these unprecedented amounts of data demands for efficient data structures and algorithms. One such data structures is the substring index, that represents all substrings or substrings up to a certain length contained in a given text. In this thesis we propose 3 substring indices, which we extend to be applicable to millions of sequences. We devise internal and external memory construction algorithms and a uniform framework for accessing the generalized suffix tree. Additionally we propose different index-based applications, e.g. exact and approximate pattern matching and different repeat search algorithms. Second, we present the read mapping tool RazerS, which aligns millions of single or paired-end reads of arbitrary lengths to their potential genomic origin using either Hamming or edit distance. Our tool can work either lossless or with a user-defined loss rate at higher speeds. Given the loss rate, we present a novel approach that guarantees not to lose more reads than specified. This enables the user to adapt to the problem at hand and provides a seamless tradeoff between sensitivity and running time. We compare RazerS with other state-of-the-art read mappers and show that it has the highest sensitivity and a comparable performance on various real-world datasets. At last, we propose a general approach for frequency based string mining, which has many applications, e.g. in contrast data mining. Our contribution is a novel and lightweight algorithm that is faster and uses less memory than the best available algorithms. We show its applicability for mining multiple databases with a variety of frequency constraints. As such, we use the notion of entropy from information theory to generalize the emerging substring mining problem to multiple databases. To demonstrate the improvement of our algorithm we compared to recent approaches on real-world experiments of various string domains, e.g. natural language, DNA, or protein sequences

    Fast and accurate read mapping with approximate seeds and multiple backtracking

    Get PDF
    We present Masai, a read mapper representing the state-of-the-art in terms of speed and accuracy. Our tool is an order of magnitude faster than RazerS 3 and mrFAST, 2-4 times faster and more accurate than Bowtie 2 and BWA. The novelties of our read mapper are filtration with approximate seeds and a method for multiple backtracking. Approximate seeds, compared with exact seeds, increase filtration specificity while preserving sensitivity. Multiple backtracking amortizes the cost of searching a large set of seeds by taking advantage of the repetitiveness of next-generation sequencing data. Combined together, these two methods significantly speed up approximate search on genomic data sets. Masai is implemented in C++ using the SeqAn library. The source code is distributed under the BSD license and binaries for Linux, Mac OS X and Windows can be freely downloaded from http://www.seqan.de/projects/masai

    RazerS 3: Faster, fully sensitive read mapping

    Get PDF
    Motivation: During the last years NGS sequencing has become a key technology for many applications in the biomedical sciences. Throughput continues to increase and new protocols provide longer reads than currently available. In almost all applications, read mapping is a first step. Hence, it is crucial to have algorithms and implementations that perform fast, with high sensitivity, and are able to deal with long reads and a large absolute number of indels. Results: RazerS is a read mapping program with adjustable sensitivity based on counting q-grams. In this work we propose the successor RazerS 3 which now supports shared-memory parallelism, an additional seed-based filter with adjustable sensitivity, a much faster, banded version of the Myers’ bit-vector algorithm for verification, memory saving measures and support for the SAM output format. This leads to a much improved performance for mapping reads, in particular long reads with many errors. We extensively compare RazerS 3 with other popular read mappers and show that its results are often superior to them in terms of sensitivity while exhibiting practical and often competetive run times. In addition, RazerS 3 works without a precomputed index. Availability and Implementation: Source code and binaries are freely available for download at http://www.seqan.de/projects/razers. RazerS 3 is implemented in C++ and OpenMP under a GPL license using the SeqAn library and supports Linux, Mac OS X, and Windows

    RazerS - Fast Read Mapping with Sensitivity Control

    Get PDF
    Second-generation sequencing technologies deliver DNA sequence data at unprecedented high throughput. Common to most biological applications is a mapping of the reads to an almost identical or highly similar reference genome. Due to the large amounts of data, efficient algorithms and implementations are crucial for this task. We present an efficient read mapping tool called RazerS. It allows the user to align sequencing reads of arbitrary length using either the Hamming distance or the edit distance. Our tool can work either lossless or with a user-defined loss rate at higher speeds. Given the loss rate, we present an approach that guarantees not to lose more reads than specified. This enables the user to adapt to the problem at hand and provides a seamless tradeoff between sensitivity and running time

    Segment-based multiple sequence alignment

    Get PDF
    Motivation: Many multiple sequence alignment tools have been developed in the past, progressing either in speed or alignment accuracy. Given the importance and wide-spread use of alignment tools, progress in both categories is a contribution to the community and has driven research in the field so far. Results: We introduce a graph-based extension to the consistency-based, progressive alignment strategy. We apply the consistency notion to segments instead of single characters. The main problem we solve in this context is to define segments of the sequences in such a way that a graph-based alignment is possible. We implemented the algorithm using the SeqAn library and report results on amino acid and DNA sequences. The benefit of our approach is threefold: (1) sequences with conserved blocks can be rapidly aligned, (2) the implementation is conceptually easy, generic and fast and (3) the consistency idea can be extended to align multiple genomic sequences. Availability: The segment-based multiple sequence alignment tool can be downloaded from http://www.seqan.de/projects/msa.html. A novel version of T-Coffee interfaced with the tool is available from http://www.tcoffee.org. The usage of the tool is described in both documentations. Contact: [email protected]

    Bis(μ-dimesitylborinato-κ2 O:O)bis­[(2-methyl­pyridine-κN)lithium]

    Get PDF
    The title compound, [Li2(C18H22BO)2(C6H7N)2], is a lithium dimesitylboroxide dimer in which the lithium cation is also coordinated by one mol­ecule of 2-methyl­pyridine. At the core of the structure is an Li2O2 four-membered ring. The structure is centrosymmetric with an inversion centre midway between two Li atoms. Inter­molecular C—H⋯π inter­actions and π–π inter­actions between the 2-methyl­pyridine rings exist [centroid–centroid distance = 3.6312 (16) Å]

    Detecting genomic indel variants with exact breakpoints in single- and paired-end sequencing data using SplazerS

    Get PDF
    Motivation: The reliable detection of genomic variation in resequencing data is still a major challenge, especially for variants larger than a few base pairs. Sequencing reads crossing boundaries of structural variation carry the potential for their identification, but are difficult to map. Results: Here we present a method for ‘split’ read mapping, where prefix and suffix match of a read may be interrupted by a longer gap in the read-to-reference alignment. We use this method to accurately detect medium-sized insertions and long deletions with precise breakpoints in genomic resequencing data. Compared with alternative split mapping methods, SplazerS significantly improves sensitivity for detecting large indel events, especially in variant-rich regions. Our method is robust in the presence of sequencing errors as well as alignment errors due to genomic mutations/divergence, and can be used on reads of variable lengths. Our analysis shows that SplazerS is a versatile tool applicable to unanchored or single-end as well as anchored paired-end reads. In addition, application of SplazerS to targeted resequencing data led to the interesting discovery of a complete, possibly functional gene retrocopy variant. Availability: SplazerS is available from http://www.seqan.de/projects/ splazers

    Molecular epidemiology of methicillin-resistant Staphylococcus aureus isolated from Australian veterinarians

    Get PDF
    This work investigated the molecular epidemiology and antimicrobial resistance of methicillinresistant Staphylococcus aureus (MRSA) isolated from veterinarians in Australia in 2009. The collection (n = 44) was subjected to extensive molecular typing (MLST, spa, SCCmec, dru, PFGE, virulence and antimicrobial resistance genotyping) and antimicrobial resistance phenotyping by disk diffusion. MRSA was isolated from Australian veterinarians representing various occupational emphases. The isolate collection was dominated by MRSA strains belonging to clonal complex (CC) 8 and multilocus sequence type (ST) 22. CC8 MRSA (ST8-IV [2B], spa t064; and ST612-IV [2B] , spa variable,) were strongly associated with equine practice veterinarians (OR = 17.5, 95% CI = 3.3-92.5, P < 0.001) and were often resistant to gentamicin and rifampicin. ST22-IV [2B], spa variable, were strongly associated with companion animal practice veterinarians (OR = 52.5, 95% CI = 5.2-532.7, P < 0.001) and were resistant to ciprofloxacin. A single pig practice veterinarian carried ST398-V [5C2], spa t1451. Equine practice and companion animal practice veterinarians frequently carried multiresistant-CC8 and ST22 MRSA, respectively, whereas only a single swine specialist carried MRSA ST398. The presence of these strains in veterinarians may be associated with specific antimicrobial administration practices in each animal species
    corecore